Bounded Regret for Finitely Parameterized Multi-Armed Bandits
نویسندگان
چکیده
We consider the problem of finitely parameterized multi-armed bandits where model underlying stochastic environment can be characterized based on a common unknown parameter. The true parameter is to learning agent. However, set possible parameters, which finite, known priori. propose an algorithm that simple and easy implement, we call Finitely Parameterized Upper Confidence Bound (FP-UCB) algorithm, uses information about for faster learning. In particular, show FP-UCB achieves bounded regret under some structural condition set. also that, if does not satisfy necessary condition, logarithmic regret, but with smaller preceding constant compared standard UCB algorithm. validate superior performance through extensive numerical simulations.
منابع مشابه
Bounded regret in stochastic multi-armed bandits
We study the stochastic multi-armed bandit problem when one knows the value μ(⋆) of an optimal arm, as a well as a positive lower bound on the smallest positive gap ∆. We propose a new randomized policy that attains a regret uniformly bounded over time in this setting. We also prove several lower bounds, which show in particular that bounded regret is not possible if one only knows ∆, and bound...
متن کاملBounded Regret for Finite-Armed Structured Bandits
We study a new type of K-armed bandit problem where the expected return of one arm may depend on the returns of other arms. We present a new algorithm for this general class of problems and show that under certain circumstances it is possible to achieve finite expected cumulative regret. We also give problemdependent lower bounds on the cumulative regret showing that at least in special cases t...
متن کاملSimple regret for infinitely many armed bandits
We consider a stochastic bandit problem with infinitely many arms. In this setting, the learner has no chance of trying all the arms even once and has to dedicate its limited number of samples only to a certain number of arms. All previous algorithms for this setting were designed for minimizing the cumulative regret of the learner. In this paper, we propose an algorithm aiming at minimizing th...
متن کاملPure Exploration in Finitely–Armed and Continuously–Armed Bandits
We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of forecasters that perform an on-line exploration of the arms. A forecaster is assessed in terms of its simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the ...
متن کاملPure exploration in finitely-armed and continuous-armed bandits
We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of forecasters that perform an on-line exploration of the arms. These forecasters are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Control Systems Letters
سال: 2021
ISSN: ['2475-1456']
DOI: https://doi.org/10.1109/lcsys.2020.3008798